Method of text classification using discriminative topic transformation

ABSTRACT

Text is classified by determining text features from the text, and transforming the text features to topic features. Scores are determined for each topic features using a discriminative topic model. The model includes a classifier that operates on the topic features, wherein the topic features are determined by the transformation from the text features, and the transformation is optimized to maximize the scores of a correct class relative to the scores of incorrect classes. Then, a class label with a highest score is selected for the text. In situations where the classes are organized in a hierarchical structure, the discriminative topic models apply to classes at each level conditioned on previous levels and scores are combined across levels to evaluate the highest scoring class labels.

FIELD OF THE INVENTION

This invention is related generally to a method for classifying text,and more particularly to classifying the text for a large number ofcategories.

BACKGROUND OF THE INVENTION

Text classification is an important problem for many tasks in naturallanguage processing, such as user-interfaces for command and control. Insuch methods, training data derived from a number of classes of text areused to optimize parameters used by a method for estimating a mostlikely class for the text.

Multinomial Logistic Regression (MLR) Classifiers for TextClassification.

Text classification estimates a classy from an input text x, where y isa label of the class. The text can be derived from a speech signal.

In prior art multinomial logistic regression, information about theinput text is encoded using a feature functionƒ_(j,k):(x,y)

{0,1},typically defined such that

${f_{j,k}\left( {x,y} \right)} = \left\{ \begin{matrix}1 & {{t_{j} \in {x\mspace{14mu}{and}\mspace{14mu} y}} = I_{k}} \\0 & {{otherwise},}\end{matrix} \right.$

In other words, the feature is 1 if a term t_(j) is contained in thetext x, the class label y is equal to category I_(k).

A model used for the classification is a conditional exponential modelof the form

${{p_{\Lambda}\left( y \middle| x \right)} = {\frac{1}{Z_{\Lambda}(x)}{\mathbb{e}}^{\sum\limits_{j,k}{\lambda_{j,k}{f_{j,k}{({x,y})}}}}}},{where}$${Z_{\Lambda}(x)} = {\sum\limits_{y}^{\;}{{\mathbb{e}}^{\sum\limits_{j,k}{\lambda_{j,k}{f_{j,k}{({x,y})}}}}.}}$and λ_(j,k) and Λ are the classification parameters.

The parameters are optimized on training pairs of texts x_(i) and labelsy_(i), using an objective function

${L_{\Lambda} = {{\sum\limits_{i,j,k}{\lambda_{j,k}{f_{j,k}\left( {x_{i},y_{i}} \right)}}} - {\log{\sum\limits_{y^{\prime}}{\mathbb{e}}^{\sum\limits_{j,k}{\lambda_{j,k}{f_{j,k}{({x_{i},y^{\prime}})}}}}}}}},$which is to be maximized with respect to Λ.

Regularized Multinomial Logistic Regression Classifiers

Regularization terms can be added to classification parameters inlogistic regression to improve a generalization capability.

In regularized multinomial logistic regression classifiers, a generalformulation using both the L1-norm and the L2-norm regularizers is

${L_{\Lambda} = {{\sum\limits_{i,j,k}\;{\lambda_{j,k}{f_{j,k}\left( {x_{i},y_{i}} \right)}}} - {\log{\sum\limits_{y^{\prime}}\;{\mathbb{e}}^{\sum\limits_{j,k}\;{\lambda_{j,k}{f_{j,k}{({x_{i},y^{\prime}})}}}}}} - {\alpha{\sum\limits_{j,k}\;{\lambda_{j,k}}^{2}}} - {\beta{\sum\limits_{j,k}\;{\lambda_{j,k}}}}}},$where

$\alpha{\sum\limits_{j,k}\;{\lambda_{j,k}}^{2}}$is the L2-norm regularizer, and

$\beta{\sum\limits_{j,k}\;{\lambda_{j,k}}}$is an L1-norm regularizer, and α and β are weighting factors. Thisobjective function is again to be maximized with respect to Λ.

Various methods can optimize the parameters under these regularizations.

Topic Modeling

In prior art, probabilistic latent semantic analysis (PLSA) and latentDirichlet analysis (LDA) are generative topic models in which topics aremultinomial latent variables, and the distribution of topics depends onparticular document including the text where the words are distributedmultinomially given the topics. If the documents are associated withclasses, then such models can be used for text classification.

However with generative topic models, the class-specific parameters andthe topic-specific parameters are additive according to a logarithmicprobability.

SUMMARY OF THE INVENTION

The embodiments of the invention provide a method for classifying textusing discriminative topic transformations. The embodiments of theinvention also performs classification in problems where the classes arearranged in a hierarchy.

The method extracts features from text, and then transforms the featuresinto topic features, before classifying text to determine scores.

Specifically, the text is classified by determining text features fromthe text, and transforming the text feature to topic features. The textcan be obtained from recognized speech.

Scores are determined for each topic features using a discrcrinativetopic transformation model.

The model includes a classifier that operates on the topic features,wherein the topic features are determined by the transformation from thetext features, and the transformation is optimized to maximize thescores of a correct class relative to the scores of incorrect classes.

Then, a set of class labels with highest scores is selected for thetext. The number of labels selected can be predetermined, or dynamic.

In situations where the classes are organized in a hierarchicalstructure, where each class corresponds to a node in the hierarchy, themethod proceeds as follows. The hierarchy can be traversed in abreadth-first order.

The first stage of the method is to evaluate the class scores of theinput text at the highest level of the hierarchy (level one) using adiscriminative topic transformation model trained for the level-oneclasses in the same way as described above. Scores for each level-oneclass are produced by this stage and are used to select a set oflevel-one classes having the greatest scores. For each of the selectedlevel-one classes, the corresponding level-two child classes are thenevaluated using a discriminative topic transformation model associatedwith each level-one class. The procedure repeats for one or more levels,or until the last level of the hierarchy is reached. Scores from eachclassifier used on the path from the top level to any node of thehierarchy are combined to yield a joint score for the classification atthe level of that node. The scores are used to output the highestscoring candidates at any given level in the hierarchy. The topictransformation parameters in the discriminative topic transformationmodels can be shared among one or more subsets of the models, in orderto promote generalization within the hierarchy.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow diagram of a text classification method and systemaccording to embodiments of the invention, and

FIG. 2 is a flow diagram of a hierarchical text classification methodand system according to embodiments of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The embodiments of the invention provide a method for classifying textusing discriminative topic transformation model.

The method extracts text features ƒ_(j,k)(x,y) from the text to beclassified, where j is an index for a type of feature, k is an index ofa class associated with the feature, x is the text, and y is ahypothesis of the class.

The text features are transformed to topic features usingg _(l,k)(x,y)=h _(l)(ƒ_(1,k)(x,y), . . . ,ƒ_(J,k)(x,y)),where h_(l)(•) is a function that transforms the text features, and l isan index of the topic features.

The term “topic features” is used because the features are related tosemantic aspects of the text. As used in the art and herein, “semantics”relate to the meaning of the text in a natural language as a whole.Semantics focuses on a relation between signifiers, such as words,phrases, signs and symbols, and what the signifiers denote. Semantics isdistinguished from the “dictionary” meaning of the individual words.

A linear transform,h _(l)(ƒ_(1,k)(x,y), . . . ,ƒ_(J,k)(x,y))=Σ_(j) A _(l,j)ƒ_(j,k)(x,y),parameterized by a feature transformation matrix A, produces the topicfeatures

${g_{l,k}\left( {x,y} \right)} = {\sum\limits_{j}\;{A_{l,j}{{f_{j,k}\left( {x,y} \right)}.}}}$

Then, our discriminative topic transformation model is

${{p_{\Lambda,A}\left( {y❘x} \right)} = {\frac{1}{Z_{\Lambda,A}(x)}{\mathbb{e}}^{\sum\limits_{l,j,k}\;{\lambda_{l,k}A_{l,j}{f_{j,k}{({x,y})}}}}}},{where}$${Z_{\Lambda,A}(x)} = {\sum\limits_{y}\;{{\mathbb{e}}^{\sum\limits_{l,j,k}\;{\lambda_{l,k}A_{l,j}{f_{j,k}{({x,y})}}}}.}}$

We construct and optimize our model using training text. The modelincludes the set classification parameters Λ and and the featuretransformation matrix A. The parameters maximize the scores of thecorrect class labels. The model is also used to evaluate the scoresduring classification. The construction can be done in a one timepreprocessing step.

The model parameters can also be regularized during optimization usingvarious regularizers designed for the feature transformation matrix A,and the classification parameters Λ.

One way uses a mixture of

${L\; 2\;\alpha{\sum\limits_{j,k}\;{\lambda_{j,k}}^{2}}},{L\; 1\;\beta{\sum\limits_{j,k}\;{\lambda_{j,k}}}}$regularizers on the classification parameters Λ, and a combined L1/L2regularizer

$\gamma{\sum\limits_{l}\;\left( {\sum\limits_{j}\;{A_{l,j}}} \right)^{2}}$on the feature transformation matrix A matrix, where α, β, and γ areweighting factors.

Objective Function for Training Model Parameters

Then, the objective function for training model parameters Λ and A ontraining pairs of texts x_(i) and labels y_(i) is

${L_{\Lambda,A} = {{\sum\limits_{i}\;{\log\left( {p_{\Lambda,A}\left( {y_{i}❘x_{i}} \right)} \right)}} - {\alpha{\sum\limits_{l,k}\;{\lambda_{l,k}}^{2}}} - {\beta{\sum\limits_{l,k}\;{\lambda_{l,k}}}} - {\gamma{\sum\limits_{l}\;\left( {\sum\limits_{j}\;{A_{l,j}}} \right)^{2}}}}},$where α, β, γ are the weights controlling a relative strength of eachregularizer, which are determined using cross-validation. This objectivefunction is to be maximized with respect to Λ and A.

Scoring

Scores for each classy given text x can be computed using a similarformula as used in the objective function above, leaving out theconstant terms:

${s_{\Lambda,A}\left( {y❘x} \right)} = {\sum\limits_{l,j,k}\;{\lambda_{l,k}A_{l,j}{{f_{j,k}\left( {x,y} \right)}.}}}$

Hierarchical Classification

We now consider the case where the classes are organized in ahierarchical structure. For each text x, we now have labels y^(d), d=1,. . . , D for each level of the hierarchy. The label variable y^(d) ateach level d takes values in a set C^(d). The set of considered valuesfor y^(d) can be restricted to a subset C^(d)(y^(1:(d-1))) according tothe values taken by the label variables y^(1:(d-1))=y¹, . . . , y^(d-1)at previous levels.

For example, in the case of a tree structure for the classes, each setC^(d)(y^(1:(d-1))) can be defined as the set of children of the labely^(d-1) at level d−1.

For estimating the class at each level d, we can construct classifiersfor the text that depend on the hypothesis of the classes at theprevious levels d′≦d−1. The score for class y^(d) is computed using thefollowing formula:

${{{}_{}^{}{}_{}^{}}\left( y^{1:{({d - 1})}} \right)},{A^{({{y^{d}❘x},y^{1:{({d - 1})}}})} = {\sum\limits_{l,j,k}\;{{\lambda_{l,k}^{d}\left( y^{1:{({d - 1})}} \right)}A_{l,j}{f_{j,k}\left( {x,y^{d}} \right)}}}},$where Λ^(d)(y^(1:(d-1)) is the set of parameters for classes at level dgiven the classes at level 1 to d−1. Optionally, the matrix A can dependon the level d and previous levels' classes y^(1:(d-1)), but there maybe advantages to having it shared across levels.

In the case of a tree representation, one possibility is to simplify theabove formula to

${{{s_{\Lambda^{d}}}_{{(y^{d - 1})},A}\left( {{y^{d}❘x},y^{d - 1}} \right)} = {\sum\limits_{l,j,k}\;{{\lambda_{l.k}^{d}\left( y^{d - 1} \right)}A_{l,j}{f_{j,k}\left( {x,y^{d}} \right)}}}},$so that scoring only depends on the class of the previous level.

In this framework, inference can be performed by traversing thehierarchy, and combining scores across levels for combinations ofhypotheses y^(1:d).

Combining the scores across levels can be done in many ways. Here, weshall consider summing over scores from different levels:

${s\left( {y^{1:d}❘x} \right)} = {\sum\limits_{d^{\prime} \leq d}\;{s_{{\Lambda^{d^{\prime}}{(y^{1:{({d^{\prime} - 1})}})}},A}\left( {{y^{d^{\prime}}❘x},y^{1:{({d^{\prime} - 1})}}} \right)}}$In some contexts, it can be important to determine the marginal scores(y^(d)|x) of y^(d). In the case of conditional exponential models, thisis given (up to an irrelevant constant) by

${s\left( {y^{d}❘x} \right)} = {{\log\left( {\sum\limits_{y^{1:{({d - 1})}}}{\exp\left( {s\left( {y^{1:d}❘x} \right)} \right)}} \right)}.}$

In the case of a tree, we simply have s(y^(d)|x)=s(y^(1:d)|x) as thereis only a single path that leads to y^(d).

The combined scores for different hypotheses are used to rank thehypotheses and determine the most likely classes at each level for theinput text.

Traversing the hierarchy can also be done in many ways, we traverse thehierarchy from the top in a breadth-first search strategy. In thiscontext, we can speed up the process by eliminating from considerationhypotheses y^(1:(d-1)) up to level d−1 whose scores are too low. Atlevel d, we now only have to consider hypotheses y^(1:d) that includethe top scoring y^(1:(d-1)).

The hierarchy can also be represented by a directed acyclic graph (DAG).The DAG has no cycles. An undirected graph can be converted into a DAGby choosing a total ordering of the nodes of the undirected graph, andorienting every edge between two nodes from the node earlier in theorder to the node later in the order.

Method

FIG. 1 shows a method for classifying text using discriminative topictransformation models according to embodiments of our invention.

As described above, we construct 105 our model 103 from known labeledtraining text 104 during preprocessing.

After the model is constructed, unknown unlabeled text can beclassified.

Input to the method is text 101, where the text includes glyphs,characters, symbols, words, phrases, or sentences. The text can bederived from speech.

Output is a set of class labels 102 that most likely correspond to theunknown input text, i.e., class hypotheses.

Using the model, text features 111 are determined 110 from the inputtext 101. The text features are transformed 120 to topic features 121.

Class scores are determined 130 according to the model 103. Then, theset of class labels 102 with the highest scores is produced.

The steps of the above methods can be performed in a processor 100connected to memory and input/output interfaces as known in the art.

FIG. 2 shows a method for classifying text using the above method in thecase where the classes are arranged in a tree-structured hierarchy.

Parameters 202 are constructed according to the above method forperforming classification at each level of the hierarchy. Scores forlevel 1 classes are evaluated 210 on unlabeled text 201 as above,producing scores for level 1 classes 203. One or more nodes in the nextlevel 2 are then selected 220 based on the scores for level 1. Scoresfor selected nodes for level 2 are again evaluated 230 using the abovemethod on unlabeled text 201, and are aggregated 204 with scores for theprevious level.

The same method is performed at each subsequent level of the hierarchy,beginning with selection 220 of nodes for the level i, evaluation 230 ofscores at level i, and storage of the scores up to level i 204.

After the scores up to the final level i=n have been aggregated, thescores are combined 240 across levels, and the set 205 of class labelsfor each level with the highest scores is produced.

EFFECT OF THE INVENTION

The invention provides an alternative to conventional textclassification methods. Conventional methods can use features based ontopic models. However, those features are not discriminatively trainedwithin a framework of the classifier.

The use of topic features allows parameters to be shared among allclasses, which enables the model to determine relationships betweenwords across the classes, in contrast to only within each class, as inconventional classification models.

The topic features also allow the parameters for each class to be usedfor all classes, which can reduce noise and over-fitting during theparameter estimation, and improve generalization.

Relative to latent variable topic models, our model involvesmultiplication of the topic-specific and class-specific parameters inthe log probability domain, whereas the prior art latent variable topicmodels involve addition in the log probability domain, which yields adifferent set of possible models.

As another advantage, our method uses a multivariate logistic functionwith optimization that is less sensitive to the training texts pointsthat are far from a decision boundary.

The hierarchical operation of the classification combined with thediscriminative topic transformations, enables the system to generalizewell from training data by sharing parameters among classes. It alsoenables to back off to higher level classes if inference at lower levelscannot be performed with sufficient confidence.

Although the invention has been described by way of examples ofpreferred embodiments, it is to be understood that various otheradaptations and modifications can be made within the spirit and scope ofthe invention. Therefore, it is the object of the appended claims tocover all such variations and modifications as come within the truespirit and scope of the invention.

We claim:
 1. A method for classifying text, comprising-steps of:acquiring text as input data in a processor, wherein the text is derivedfrom one or more hypotheses from an automatic speech recognition systemoperating on a speech signal; determining text features from the text x,wherein the text features are ƒ_(j,k)(x,y); transforming the textfeatures to topic features, wherein the transforming is according tog_(l,k)(x,y)=h_(l)(ƒ_(1,k)(x,y), . . . ,ƒ_(J,k)(x,y)), where j is anindex for a type of feature, k is an index of a class associated withthe feature, y is a hypothesis of the class label, and h_(l)(•) is afunction that transforms the text features, and l is an index of thetopic features; determining scores from the topic features, wherein thedetermining steps use a model, wherein the model is a discriminativetopic model comprising a classifier operating on the topic features, andthe transforming is optimized to maximize the scores of a correct classrelative to the scores of incorrect classes, wherein the discriminativetopic model is$\max\limits_{\Lambda,A}\left\{ {{\log\left( {p_{\Lambda,A}\left( {y❘x} \right)} \right)} - {\alpha{\sum\limits_{l,k}\;{\lambda_{l,k}}^{2}}} - {\beta{\sum\limits_{l,k}\;{\lambda_{l,k}}}} - {\gamma{\sum\limits_{l}\;\left( {\sum\limits_{j}\;{A_{l,j}}} \right)^{2}}}} \right\}$where α, β, γ are weight, and Λ is a classification optimizingparameter; selecting a set of class labels with highest scores for thetext; outputting the set of class labels to classify the text, whereinthe steps are performed in the processor.
 2. The method of claim 1,wherein the topic features are a linear transformation of the textfeatures.
 3. The method of claim 1, wherein parameters of the model areregularized using regularizers comprising L1-norm, L2-norm, andmixed-norm regularizers.
 4. The method of claim 1, wherein the topicfeatures relate to semantic aspects of the text.
 5. The method of claim1, wherein a linear transformh _(l)(ƒ_(1,k)(x,y), . . . ,ƒ_(J,k)(x,y))=Σ_(j) A _(l,j)ƒ_(j,k)(x,y) isparameterized by a feature transformation matrix A to produces the topicfeatures${g_{l,k}\left( {x,y} \right)} = {\sum\limits_{j}\;{A_{l,j}{{f_{j,k}\left( {x,y} \right)}.}}}$6. The method of claim 1, wherein the weights are determined bycross-validation.
 7. The method of claim 1, wherein the classifying isaccording to semantics of a natural language used by the text.
 8. Themethod of claim 1, wherein the classes are organized in a hierarchicalstructure, wherein each class corresponds to a node in the hierarchy,wherein nodes are assigned to different levels of the hierarchy, whereindifferent classification parameters are used for one or more of thelevels of the hierarchy, wherein classification is performed bytraversing the hierarchy to evaluate partial scores of the classes ateach level conditioned on hypotheses of the classes at previous levels,and combining the partial scores of the classes at one or more of thelevels to determine a joint score.
 9. The method of claim 8, wherein thehierarchy is represented as a tree.
 10. The method of claim 8, whereinthe hierarchy is represented as a directed acyclic graph.
 11. The methodof claim 8, wherein the hierarchy is traversed in a breadth-firstmanner.
 12. The method of claim 8, wherein the scores at one or morelevels are used to eliminate hypotheses from consideration at otherlevels.
 13. The method of claim 12, wherein at a given level all but thehighest scoring hypotheses are eliminated from further consideration.14. The method of claim 12, wherein at a given level all but n highestscoring hypotheses are eliminated from further consideration, for somepositive integer n.
 15. The method of claim 8, wherein the joint scoreof a sequence of classes along a path from a top level to a class atanother level is determined by summing the partial scores along thepath.
 16. The method of claim 15, wherein the score of the class at aparticular level is determined by marginalizing the joint scores of allpaths leading to the class.